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Q ^ Abstract. The rates-across-sites assumption in phylogenetic inference posits that the 

■ rate matrix governing the Markovian evolution of a character on an edge of the putative 
phylogenetic tree is the product of a character-specific scale factor and a rate matrix that is 
particular to that edge. Thus, evolution follows basically the same process for all characters, 
except that it occurs faster for some characters than others. To allow estimation of tree 

I— I ■ topologies and edge lengths for such models, it is commonly assumed that the scale factors 

are not arbitrary unknown constants, but rather unobserved, independent, identically dis- 
tributed draws from a member of some parametric family of distributions. A popular choice 
Q I is the gamma family. We consider an example of a clock-like tree with three taxa, one un- 

' ^ ■ known edge length, a known root state, and a parametric family of scale factor distributions 

I ' that contain the gamma family. This model has the property that, for a generic choice of 

unknown edge length and scale factor distribution, there is another edge length and scale 
\ factor distribution which generates data with exactly the same distribution, so that even 

I with infinitely many data it will be typically impossible to make correct inferences about 

^ I the unknown edge length. 

^ ! 
O ■ 
00 ■ 

o : 

^ ! 1. Introduction 

"o : , , 

■ Beginning with the germinal work |Fel78j . statistically-based estimations of phylogenetic 

Q^! trees have become popular in molecular systematics, with Bayesian |HR01j and maximum 

. > • likelihood methods |Swo96[ 1(4(4031 IPMnOl ILew98[ l()MH()94j used with increasing frequency. 
V ■ 

■ Such statistically-based methods assume that the observed sequences are the result of a sto- 
chastic process that has operated on a tree, and they make assumptions about the stochastic 
process (that is, model) that has produced the data. 

A fundamental question about any statistical model is whether it is identifiable: that is, 
whether different parameter values lead to different probability distributions for the data, so 
that, in particular, there is some hope of estimating the parameters with increasing accuracy 
as the amount of data increases. These questions have been investigated extensively for 
certain models used in phylogenetic inference (see, for example, jSte94[ ICha96j ). 
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Many models used in phylogenetic inference combine a (typically parameter rich) model 
of individual site evolution with the assumption that the different sites evolve under a rates- 
acrosssites model, so that each site c has an associated rate of evolution which is constant 
across the tree. Thus, sites evolve under essentially the same evolutionary process, and are 
just scaled up (or down) versions of each other. (Thus, the rates-across-sites assumption 
implies that if one site is expected to evolve twice as fast as another site on edge e, then it 
is expected to evolve twice as fast as the other site on every edge.) 

Such rates-across-sites models in which each character has its own unknown scale factor 
are discussed in ISUWHOG] , but these models still pose difficult inferential problems. As 
remarked in Chapter 13 of |Fel04j : 

As the number of sites increases, the number of parameters being estimated 
rises correspondingly. This is worrisome: in such "infinitely many parame- 
ters" cases maximum likelihood often misbehaves and fails to converge to the 
correct tree as the number of sites increases. 

Indeed, our own example below shows that relative edge-lengths are, in general, unidentifiable 
for such models. (We discuss a situation in which the unknown scale parameters for the 
respective characters are unobserved, independent, identically distributed, realizations of 
some distribution belonging to a particular family of distributions. However, if edge-lengths 
are not identifiable in our set-up, then they certainly won't be identifiable in the analogous 
set-up where the scale parameters are arbitrary.) 

A popular 'fix' that has been proposed for this problem is to adopt a random effects 
approach and suppose that the successive scale factors Vc are unobserved, independent ran- 
dom draws from a member of some parametric family of distributions. This reduces the 
dimensionality of the problem by replacing the deterministic sequence of parameters with 
the small number of parameters that describe the generating distribution (see, for example, 
lUnTTl lNnF76l lUESTl IHKY871 lYan96jV 

A standard choice of distribution for the random scale factors is the two-parameter family 
of gamma distributions. This family has the mathematical advantage that likelihoods still 
have analytically tractable closed forms, and it was shown for a wide class of substitution 
models in |Rog01| that edge-lengths are identifiable in this setting. The choice of the gamma 
family is often supported by claims that it is sufficiently fiexible to mimic the variation of 
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rates between characters that is hkely to be seen 'in practice'. There also appears to be a 
general sense among many practitioners that the choice of distributional family for the scale 
factors is primarily a matter of convenience and that, provided the family is rich enough, 
substantially correct inferences of relative edge-lengths will be possible with sufficient data. 
To our knowledge, there is no argument in any setting justifying why an assumption of an 
exact gamma distribution for the scale factors is biologically reasonable. As remarked in 
(MOil: 

There is nothing about the gamma distribution that makes it more biologically 
realistic than any other distribution, such as the lognormal. It is used because 
of its mathematical tractability. 

It was shown in |SSH94j that the use of random scale factors might not be completely 
without problems. In their paper they gave an example of a specific choice of edge-lengths for 
each tree topology and a specific choice of discrete distribution for the scale factors (rather 
than a continuous distribution such as a gamma) such that the resulting distribution for the 
data under the Neyman two-state model is the same for all tree topologies. 

In this paper we go further, at least in some directions. We consider a rooted tree with three 
taxa and one unknown edge-length (with the remaining edge-lengths either known or fixed 
by the clock-like constraint that all lineages have the same total length), and a particular ten- 
parameter family T of scale factor distributions with a certain nine-parameter sub-family Q 
of T . We show that for a generic choice of unknown edge-length r and model in G G ^ there 
is a choice of edge-length a ^ r and model in F G with the property that data generated 
according to the Neyman 2-state model with known ancestral state, scale factor distributed 
according to G, and edge-length r, has the same distribution as data generated according 
to a Neyman 2-state model with the same ancestral state, scale factor distributed according 
to -F, and edge-length o. Thus, even with infinitely many data, it would be impossible to 
decide whether the unknown edge-length is cr or r - even if one somehow knew in advance 
that the distribution for the scale parameter was one of either F or G. Here the term generic 
means that the set of exceptional models G and edge-lengths r for which no corresponding 
F and a exist is a lower dimensional subset of ^ x M_,_. In particular, the set of G and r that 
have corresponding F and a is an everywhere dense open subset of ^ x R_^. 
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Moreover, the family T consists of distributions with smooth, unimodal, densities that 
possess moments of all orders. In this sense, each distribution in JF is as "nice" as a gamma 
distribution. Of course, the two-parameter family of gamma distributions is simpler than 
than the ten-parameter family T . However, the use of T is essentially a technical device 
in our analysis. We could have described our results by simply saying that for a generic 
unknown edge- length r there is a corresponding edge-length o" 7^ r, and two scale parameter 
distributions G and F, such that data generated according to the Neyman 2-state model 
with known ancestral state, scale factor distributed according to G, and edge-length r, 
has the same distribution as data generated according to a Neyman 2-state model with 
the same ancestral state, scale factor distributed according to F, and edge-length a. In 
particular, unidentifiability is not inherently a case of "over-parametrization" : the effect 
can be produced when we have just a finite number of possible parameter values and is 
not produced by having a continuous space of possible parameter values with too high a 
dimension. We have included the mention of the families JF and Q in the description of our 
results to stress that the unidentifiability problem is, in some sense, generic. 

The family Q (and hence contains all the gamma distributions as a subfamily. Any 
gamma distribution and any edge-length t will have a distribution G & Q and r arbitrarily 
close to them such that there is a corresponding F e and cr 7^ r as above. 

Our example applies not only to the Neyman model but also to any model such as the 
binary General Time-Reversible model that contains the Neyman model as a sub-model. 
Furthermore, our example applies to the General Time-Reversible on an arbitrary finite 
state-space, because one can choose the substitution rate matrices for such a model to 
be sufficiently symmetric so that a suitable many-to-one binary encoding of the model is 
Markovian and evolves according to the Neyman model. 

We should point out that we construct our example using a perturbative technique. Con- 
sequently, the edge-lengths r and a that arise will be "close" to each other. However, our 
analysis doesn't rule out the possibility that a similar example could be produced with 
edge-lengths that are "far apart". In order to fully assess the practical implications of the 
phenomenon we have observed, further research is necessary to quantify just how distant 
two edge-lengths can be and still have corresponding scale parameter distributions that lead 
to identical distributions for the data. Moreover, this is not a purely mathematical question. 
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because the notions of "close" and "far apart" are dependent on the scientific question being 
investigated with a particular data set. 

Also, we note that if we actually knew the distribution of the scale parameter in our 
three taxa example, then the unknown edge-length could be recovered uniquely from the 
distribution of the data, and this is so for an arbitrary scale parameter distribution, not 
just the ones we consider in this paper. Moreover, the functional that recovers the unknown 
edge-length is continuous in the scale parameter distribution when one equips the space of 
distributions with the usual topology of weak convergence. This suggests that if we somehow 
knew the scale parameter distribution up to some small error, then this would constrain the 
errors we could make in determining the edge-length. However, it is not clear how well one 
can identify the relevant features of the scale parameter distribution: the functional that 
recovers the unknown edge-length depends on the functional inverse of the Laplace transform 
of the scale parameter distribution and hence, a priori, on the entirety of the distribution 
rather than some finite dimensional set of features such as the first few moments, and so 
there is an apparent need to estimate the whole distribution quite well. Once again, this is 
not solely a theoretical matter and the extent to which this continuity observation is relevant 
will depend partly on context. 

The rest of the paper is organized as follows. We begin with an introduction of the 
mathematical terms in Section |2l and we present our example in Section IHl We conclude 
with a discussion of the ramifications of this result, and directions for future research in 
Section m 

2. Basics 

In phylogenetic inference, the data are the respective states of an ensemble of characters 
exhibited by each of a collection of taxa. The most commonly used statistical models in 
the area are parameterized by a rooted tree with edge-lengths (which typically represent the 
expected number of times a site changes on the edge when the substitution mechanism is 
in equilibrium) and a set of Markovian stochastic mechanisms for the evolution of succes- 
sive characters down the tree. It is usually assumed that the observed states for different 
characters are statistically independent. The goal of phylogenetic inference is to estimate 
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some or all of: the shape (topology) of the tree, the lengths of the edges, and any unknown 
parameters involved in the specification of the evolution mechanism. 

We will restrict attention to the case where each character has the same finite set of 
possible states. For example, the characters could be nucleotides exhibited at different sites 
on the genome, and so each character is in one of the four states {A, G, C, T}. In the example 
we will give in this paper, we will work with (binary) characters having one of two possible 
states, or 1. For each character c and each edge e in the tree, one then has a rate matrix 
Qce that describes the evolutionary process on edge e for character c (we refer the reader 
unfamiliar with continuous time Markov chains to a standard text such as |GS01j ). Thus, 
given that the character is in state i at the beginning of the edge, the conditional probability 
of the (possibly unobserved) event that it is in state j at the end of the edge is the (i,j) 
entry of the matrix exponential exp{tQe^c), where t is the length of e. The matrix Qc,e has 
row sums equal to and non-negative off-diagonal entries: —Qceihi) is the rate at which 
the character leaves the state i and —Qc,e{hj)/Qc,e{h'i) is the probability that it jumps to 
state j when it leaves state i. 

Single site substitution models can range from the very simple (e.g. the Jukes-Cantor and 
Kimura 2-parameter models) to the very complex (e.g. the General Markov Model), which, 
for a fixed character c, allow the Qe.c matrices to vary significantly from edge to edge, and to 
have many free parameters. However, the variation between the different matrices obtained 
by varying the character c is typically more proscribed. The most complex model is where 
there are no constraints placed on the Qc,e] this is called the "no common mechanism model" 
|TS97j . Under this no common mechanism model, it will clearly be difficult to recover any 
information about edge-lengths. A simple class of models in which it is possible to extract 
information about edge-lengths is the class in which Qc,e is the same for all characters 
c and edges e. Even for this simple model, there is - as is well-known - a certain lack 
of identifiability, because the same probability distribution for the data would arise if the 
common rate matrix was multiplied by a common scale factor and all edge-lengths were 
divided by that same factor. Thus, even for this model one can only hope to make inferences 
about relative edge-lengths unless at least one edge-length is assumed to be known. 

The more commonly used models assume that the different Qc,e matrices are themselves 
the product of a rate matrix specific to the edge e, and a scale factor that is specific to 
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the character c. Thus, the evolutionary process that governs one character is identical, up 
to a scalar multiple, to that governing another character. This is the rates-across-sites 
assumption in molecular phylogenetics, and it has the rather strong implication that if a 
character c is expected to evolve twice as fast on edge e as character c', then c is expected 
to evolve twice as fast on every edge in the tree. 

The assumption of a common rate matrix for all edges is the molecular clock assumption, 
which is known to be untenable in many situations |,TN9n[ lR,ee92j . Perhaps the next simplest 
class of models is the family of rates- across-sites models in which Qc,e is the product of a 
character-specific scale factor and a rate matrix that is common to all characters and edges. 
That is, Qc,e is of the form rcQ. In other words, evolution follows basically the same pattern 
on all lineages for all characters, except that it occurs faster for some characters than others. 

Because of the inferential difficulties of allowing the rates for the different sites to be 
arbitrary, these random scale factors are typically assumed to be drawn from a distribution. 
Of the many possible distributions, the most popular distributions are the two-parameter 
gamma distributions. In fact, in practice, almost all estimations of phylogenetic trees are 
based upon the assumption that the rates across sites are drawn from a gamma distribution, 
or a discretized gamma distribution. Also, it is sometimes assumed that certain characters 
are invariable (that is, that the scale parameter for such sites is 0). 



3. The example 

We will present an example of a tree with three taxa, and with sites evolving under the 
Neyman two-state model (i.e., the two-state version of the Jukes-Cantor model of evolution) 
with a known state at the root. 

Consider a tree with three taxa x, y, and z, a root v, and internal node w that is ancestral 
to X and y. The edges {w, x) and {w, y) have a known length, which we can take as 1. Suppose 
further that the edge {v, w) has unknown length a and that the the tree is clock-like, so that 
the edge {v, z) has length a -|- 1. 

Suppose there are {0, 1}- valued characters labelled 1,2,... that have evolved on this tree. 
The z*^ character evolves according to the Neyman model with rate rj. That is, the transition 
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matrix for an edge of length t is 



1 /(I + exp(-2rit)) (1 - exp(-2rit)) 

2 V(l - exp(-2rit)) (1 + exp(-2rit)) 



) 



) 



say. 



The probabihty distribution for the i^^ character (that is, the marginal likelihood for this 
character) is given as follows. Suppose it is known that the state s„ G {0, 1} is exhibited by 
the root v. Then the probabihty that states s^, Sy, and are exhibited by the taxa x, y, 
and z is 



Assume that successive characters evolve independently. 

The probability distribution for the i^^ character is thus easily seen to be a linear combi- 
nation of the terms 



As one of the referees of this paper remarked, by explicitly writing out the likelihood or using 
Corollary 8.6.6 of |SS03I one can show that only the terms 1, exp(— 4rj), exp(— 2rj(l + a)), 
exp(— 2ri(2 + 2cr)), and exp(— 2rj(3 + a)) actually appear, but we do not need to use this 
fact. 

As described in the Introduction, we will adopt the random effects approach and assume 
that the are, in fact, realizations of a sequence of independent, identically distributed 
random variables that we will denote by (Ai). 

We are interested in finding such a sequence (Ai) and another independent, identically 
distributed sequence (Bi) such that the distribution for the data induced by the random 
choice of scale factors (A^) is the same as that induced by the {Bi) for another choice of 
edge-length t ^ a. 



^ ^ Pa\3vi Sw)Pl\swi Sx)Pl\swi ^y)Pa+l{^vi ^ z) ■ 



s™e{o,i} 
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Wc thus have to find positive random variables A and B with distinct distributions and 
distinct positive constants a and r with the property that 

E[exp(-2A)] = E[exp(-25)] 
E[exp(-4^)] = E[exp(-4S)] 
E[exp(-2(TA)] = E[exp(-2TS)] 



E[exp(-2(3 + 2a)A)] = E[exp(-2(3 + 2t)B)]. 
Take A to have the distribution which has Laplace transform 

E[exp(-cA)] = |n(i + ^^c)-'| (1 + hcr'ii + kcr' 

for positive parameters di, . . . , rfy, h, k. Thus A has the distribution of the sum of 9 indepen- 
dent exponential random variables with respective means di, . . . ,d'^, h, k. Take B to have 
the distribution which has Laplace transform 

E[exp(-C5)] = |n(l + ^iC)-'| (1 +^C)-' 

for positive parameters gi, . . . , qt, i. Thus B has the distribution of the sum of 9 independent 
exponential random variables with respective means gi, . . . , gr, £, £. 
Define maps P : ^ and Q : ^ 

Pi{a,di,...,d7,h,k) = |jj(l + 2(i,)| {l + 2h){l + 2k) 
P2{a,di,...,d7,h,k) = |j](l + 4cii)|(l + 4/i)(l + 4A;) 
P3{a,di,...,d7,h,k) = |jj(l + 2di(7)l {l + 2ha){l + 2ka) 



.1=1 



Pg{a, di,..., d-!, Kk) = \ JJ(1 + 2di(3 + 2a)) \ (1 + 2/i(3 + 2a)){l + 2A;(3 + 2(t)) 



.1=1 
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and 



Q2{r, gi,... 



Q9{r, gi,... 



Qi{r, gi,...,g7,i) 



Qsir, gi,... 



,97^) 



.97 A) 



,97,i) 




l[{l + 2g,T)\{l + 2irY 



Hil + 2g,) \ {1 + 2if 



+ 2(7.(3 + 2r)) \ (1 + 2i{3 + 2r))2. 



i=l 



i=l 



We want to show that P{(J, di, . . . , dj, h, k) = Q{t, gi, . . . , g-j, i) for some choice of parameters 
with a ^ r. 

Write J{a,di, . . . ,di,h,k) for the Jacobian matrix of the mapping {di, . . . ,d7,h,k) i-^ 
P(cr, di, . . . ,di,h, k) (thus, J is a 9 x 9 matrix). Write K{t, gi, . . . , gY, i) for the Jacobian 
matrix of Q. A straightforward check with a computer algebra package such as Mathematica 
shows that the polynomials det J and det K are not identically 0. (While the determinants 
could possibly be computed symbolically, it is easier to compute the matrices symbolically, 
substitute in appropriate integer values for the parameters, and use exact integer arithmetic 
to compute the determinant for those values: For example, det J(2, 3, 4, 5, 6, 7, 8, 9, 10, 11) ^ 
and det 7^(2, 3, 4, 5, 6, 7, 8, 9, 10) 7^ 0.) Because these determinants are polynomials, the 
set of values where J (resp. K) is non-singular is a relatively open subset of M.^ (resp. M^) 
with a closure that is all of M.^ (resp. M^) (that is, they are everywhere dense). 

We can therefore find a point {f,gi,..., ^7, i) in the interior of such that 

(i) the matrix K{f, cji, . . . , cjj, i) is non-singular and 

(ii) in any open neighborhood of {gi, . . . , g^, 1, 1) G there are points (rfi, . . . , 6/7, h, k) 
such that the matrix J(r, (ii, . . . , d-j, h, k) is non-singular. 

By assumption (i) and the implicit function theorem (see, for example, |KP02j ). the range 
of Q contains an open neighborhood of Q{f, gi, . . . ,gi,i). Note that P(r, gi, . . . , gj, 1, 1) = 
Q{f,gi,...,gi,i), and so for all points {a,di, . . . ,d7,h,k) in some open neighbor- 
hood of {f,gi,...,gi,i,i) we can find {t, gi, . . . , gj, i) such that P{a,di, . . . ,dY,h,k) = 



Q{t, gi,... 



,97^)- 
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We will be done if we can show that it is not always the case that a = t for such a 
solution. To see this, we will fix a = f and let (di, . . . , dy, /i, k) vary. By assumption (ii) 
and the implicit function theorem, the image of any open neighborhood of (pi, . . . ,g7,i,i) 
by the map {di, . . . ,dY,h,k) i— > P(f, di, . . . , d^, h, k) has non-empty interior. However, the 
range of the map {gi, . . . , g-^, £) i— > Q{f, gi, . . . , g7,£) is at most 8-dimensional, and, in par- 
ticular, has empty interior. Therefore, there certainly exists {di, . . . ,dY,h,k) such that 
P(f, di,..., dr, h, k) = Q{t, gi, . . . , g7, i) for some (r, gi, . . . , gr, i) with r f . 

Remark. Note that if we take gi — ■ ■ • — gv — then we have a gamma distribution with 
shape parameter 9. Also, we could still produce the unidentifiability phenomenon witnessed 
above if we raised all the Laplace transforms to the same power c > 0. In that case, setting 
gi — ■ ■ ■ = gr = i would give a gamma distribution with shape parameter 9c. Since the 
unidentifiability occurs on a dense set of parameters {gi, . . . , g^, i), any gamma distribution 
will have distributions arbitrarily close to it that exhibit the phenomenon. 

4. Conclusions and Future Research 

The example we have given shows that the attempt to achieve identifiability and reasonable 
inference of edge-lengths in the rates-across-sites model by using random scale factors that 
come from some common distribution can be problematic. 

The gamma distributions 'work', but distributions arbitrarily close to any given gamma 
with smooth, unimodal densities and finite moments of all orders don't. Using the gamma 
family is thus not just a matter of working with distributions that have enough flexibility to 
capture reasonable variation in rates. Rather, identifiability of edge-lengths for the gamma 
family relies on quite specific features of members of that family that are not shared by 
equally reasonable distributions. 

The result has consequences for the estimation of times at internal nodes, since if edge- 
lengths cannot be estimated, then neither can the dates (since the edge-length is a product 
of the elapsed time on the edge, and the equilibrium expected rate of evolution for on that 
edge) . 

Finally, although our main result is theoretical, its consequences can be tested in simu- 
lation. To date, few (if any) such studies have been done that have not presumed that the 
rates are distributed by a gamma distribution, or a distribution consisting of some invariable 
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sites, and the remaining sites evolving under a gamma distribution. This also reflects the 
implicit belief that the assumption of a gamma distribution is acceptable. We hope this 
paper will help encourage researchers to reconsider this assumption. 
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